HW5 Michal Grotkowski¶

Task 1.¶

$$f(x_1, x_2) = (x_1+x_2)^2 $$$$ x_1,x_2 \sim \mathcal{U}[-1,1], x_1 = x_2$$$$g_1^{PD}(z) = \mathbb{E}_{x_2}(z+x_2)^2 = \mathbb{E}_{x_2} z^2 + 2zx_2 + x_2^2 =$$$$=z^2 + 2z \cdot 0 + \mathbb{E}_{x_2} x_2^2 $$$$\mathbb{E}x_2^2 = \int_{-1}^1 \dfrac 12 x^2 dx = \dfrac 13 $$

So we get: $g_1^{PD}(z) = z^2 + \dfrac 13 $

Task 2.¶

For this task I have a trained an xgboost model on adult income dataset. We will study these two following observations:

In [44]:
X_test.iloc[[100,2]]
Out[44]:
age fnlwgt educational-num capital-gain capital-loss hours-per-week workclass_Federal-gov workclass_Local-gov workclass_Never-worked workclass_Private ... native-country_Portugal native-country_Puerto-Rico native-country_Scotland native-country_South native-country_Taiwan native-country_Thailand native-country_Trinadad&Tobago native-country_United-States native-country_Vietnam native-country_Yugoslavia
784 49 139571 9 4064 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 1 0 0
524 53 139671 10 0 0 50 0 1 0 0 ... 0 0 0 0 0 0 0 1 0 0

2 rows × 100 columns

Some important differences between these two observations are that one person works in a private sector while the other is a government worker. Here are the predictions of the xgboost model with $1$ meaning that their income will be more than 50K USD.

In [55]:
print("predicted classes: %s and %s" % (model.predict(X_test.iloc[[100]])[0],model.predict(X_test.iloc[[2]])[0]))
print("true classes %s and %s" % (target_test.iloc[100], target_test.iloc[2]))
predicted classes: 0 and 1
true classes 0 and 0

Ceteris Paribus profiles for age and capital-loss for respective observations are located below. For the private sector worker the age profile is mostly flat while the other one experiences much bigger changes. This may indicate that age plays less of a role in assigning a higher income by the model for workers in the private sector.

The capital-loss curves also differ such that the influence of capital-loss value for the private worker is either neutral or positive in assigning a high income class, while for the government worker we can see a negative influence of a higher value as well. This is surprising as we would expect that capital-loss would be inversely correlated with higher income, but private workers could take more financial risks. This fluctuating pattern may also be caused by out-of-distribution problem as the vast majority of capital-loss values in the dataset are near zero.

In [60]:
cp.plot(variables = ["age", "capital-loss"], title = "Private sector worer")
cp2.plot(variables = ["age", "capital-loss"], title = "Government worker")
No description has been provided for this image
No description has been provided for this image

I have trained a Logistic Regression model as well and below are the Partial Dependence profiles for age. PDP profiles for logistic regression will all create a logistic curve with the rate of its growth indicating the value of coefficient associated with age. We can see that on average for both models age increase indicates a higher chance of the model assigning high income to observation. This relationship is monotonic for logistic regression while for XGBoost we can see a large bump between the ages of 70-80. This may indicate the presence of a higher number of observations in the training set in this age bracket that do in fact earn more.

In [61]:
pdp.plot(variables=["age"], geom="profiles", title="Partial Dependence Plot - XGBoost ")
pdp2.plot(variables=["age"], geom="profiles", title="Partial Dependence Plot - Logistic Regression")
No description has been provided for this image
No description has been provided for this image

Appenidx¶

In [59]:
import plotly.io as pio

pio.renderers.default = "jupyterlab"
In [2]:
from google.colab import files
files.upload()
!unzip archive.zip
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-2-d986f439b1ea> in <cell line: 2>()
      1 from google.colab import files
----> 2 files.upload()
      3 get_ipython().system('unzip archive.zip')

/usr/local/lib/python3.10/dist-packages/google/colab/files.py in upload()
     67   """
     68 
---> 69   uploaded_files = _upload_files(multiple=True)
     70   # Mapping from original filename to filename as saved locally.
     71   local_filenames = dict()

/usr/local/lib/python3.10/dist-packages/google/colab/files.py in _upload_files(multiple)
    154 
    155   # First result is always an indication that the file picker has completed.
--> 156   result = _output.eval_js(
    157       'google.colab._files._uploadFiles("{input_id}", "{output_id}")'.format(
    158           input_id=input_id, output_id=output_id

/usr/local/lib/python3.10/dist-packages/google/colab/output/_js.py in eval_js(script, ignore_result, timeout_sec)
     38   if ignore_result:
     39     return
---> 40   return _message.read_reply_from_input(request_id, timeout_sec)
     41 
     42 

/usr/local/lib/python3.10/dist-packages/google/colab/_message.py in read_reply_from_input(message_id, timeout_sec)
     94     reply = _read_next_input_message()
     95     if reply == _NOT_READY or not isinstance(reply, dict):
---> 96       time.sleep(0.025)
     97       continue
     98     if (

KeyboardInterrupt: 
In [3]:
import pandas as pd
df = pd.read_csv("adult.csv")
df["income"] = pd.get_dummies(df["income"], drop_first = True)
cols=df.select_dtypes(exclude=['int', 'uint8']).columns.to_list()
df[cols] = df[cols].astype("category")

X = df.iloc[:, :-1]
target = df["income"]
X.head()
Out[3]:
age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country
0 25 Private 226802 11th 7 Never-married Machine-op-inspct Own-child Black Male 0 0 40 United-States
1 38 Private 89814 HS-grad 9 Married-civ-spouse Farming-fishing Husband White Male 0 0 50 United-States
2 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse Protective-serv Husband White Male 0 0 40 United-States
3 44 Private 160323 Some-college 10 Married-civ-spouse Machine-op-inspct Husband Black Male 7688 0 40 United-States
4 18 ? 103497 Some-college 10 Never-married ? Own-child White Female 0 0 30 United-States
In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, target_train, target_test = train_test_split(pd.get_dummies(X, drop_first = True), target, random_state = 42, train_size = 0.8, stratify = target)
In [5]:
import xgboost as xgb
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

linear_model = make_pipeline(StandardScaler(), LogisticRegression())
model = xgb.XGBClassifier()
model.fit(X_train, target_train)

linear_model.fit(X_train, target_train)

print(accuracy_score(target_test, model.predict(X_test)),accuracy_score(target_test, linear_model.predict(X_test)))
0.8754222540689938 0.8537209540382844
In [6]:
!pip install dalex
Requirement already satisfied: dalex in /usr/local/lib/python3.10/dist-packages (1.6.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from dalex) (67.7.2)
Requirement already satisfied: pandas>=1.2.5 in /usr/local/lib/python3.10/dist-packages (from dalex) (1.5.3)
Requirement already satisfied: numpy>=1.20.3 in /usr/local/lib/python3.10/dist-packages (from dalex) (1.23.5)
Requirement already satisfied: scipy>=1.6.3 in /usr/local/lib/python3.10/dist-packages (from dalex) (1.11.3)
Requirement already satisfied: plotly>=5.1.0 in /usr/local/lib/python3.10/dist-packages (from dalex) (5.15.0)
Requirement already satisfied: tqdm>=4.61.2 in /usr/local/lib/python3.10/dist-packages (from dalex) (4.66.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.2.5->dalex) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.2.5->dalex) (2023.3.post1)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from plotly>=5.1.0->dalex) (8.2.3)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from plotly>=5.1.0->dalex) (23.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas>=1.2.5->dalex) (1.16.0)
In [7]:
import dalex as dx

def predict_func(model, df):
    return model.predict_proba(df)[:, 1]

explainer = dx.Explainer(model, X_test, target_test, predict_function = predict_func)
explainer.model_performance()
Preparation of a new explainer is initiated

  -> data              : 9769 rows 100 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 9769 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function predict_func at 0x7fde2ee6cb80> will be used
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 1.44e-05, mean = 0.237, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.986, mean = 0.00211, max = 1.0
  -> model_info        : package xgboost

A new explainer has been created!
Out[7]:
recall precision f1 accuracy auc
XGBClassifier 0.65355 0.789664 0.715188 0.875422 0.929625
In [19]:
for col in X_test.columns:
    print(col)
age
fnlwgt
educational-num
capital-gain
capital-loss
hours-per-week
workclass_Federal-gov
workclass_Local-gov
workclass_Never-worked
workclass_Private
workclass_Self-emp-inc
workclass_Self-emp-not-inc
workclass_State-gov
workclass_Without-pay
education_11th
education_12th
education_1st-4th
education_5th-6th
education_7th-8th
education_9th
education_Assoc-acdm
education_Assoc-voc
education_Bachelors
education_Doctorate
education_HS-grad
education_Masters
education_Preschool
education_Prof-school
education_Some-college
marital-status_Married-AF-spouse
marital-status_Married-civ-spouse
marital-status_Married-spouse-absent
marital-status_Never-married
marital-status_Separated
marital-status_Widowed
occupation_Adm-clerical
occupation_Armed-Forces
occupation_Craft-repair
occupation_Exec-managerial
occupation_Farming-fishing
occupation_Handlers-cleaners
occupation_Machine-op-inspct
occupation_Other-service
occupation_Priv-house-serv
occupation_Prof-specialty
occupation_Protective-serv
occupation_Sales
occupation_Tech-support
occupation_Transport-moving
relationship_Not-in-family
relationship_Other-relative
relationship_Own-child
relationship_Unmarried
relationship_Wife
race_Asian-Pac-Islander
race_Black
race_Other
race_White
gender_Male
native-country_Cambodia
native-country_Canada
native-country_China
native-country_Columbia
native-country_Cuba
native-country_Dominican-Republic
native-country_Ecuador
native-country_El-Salvador
native-country_England
native-country_France
native-country_Germany
native-country_Greece
native-country_Guatemala
native-country_Haiti
native-country_Holand-Netherlands
native-country_Honduras
native-country_Hong
native-country_Hungary
native-country_India
native-country_Iran
native-country_Ireland
native-country_Italy
native-country_Jamaica
native-country_Japan
native-country_Laos
native-country_Mexico
native-country_Nicaragua
native-country_Outlying-US(Guam-USVI-etc)
native-country_Peru
native-country_Philippines
native-country_Poland
native-country_Portugal
native-country_Puerto-Rico
native-country_Scotland
native-country_South
native-country_Taiwan
native-country_Thailand
native-country_Trinadad&Tobago
native-country_United-States
native-country_Vietnam
native-country_Yugoslavia
In [8]:
cp = explainer.predict_profile(new_observation = X_test.iloc[[100]])
Calculating ceteris paribus: 100%|██████████| 100/100 [00:04<00:00, 22.92it/s]
/usr/local/lib/python3.10/dist-packages/dalex/predict_explanations/_ceteris_paribus/utils.py:30: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  profiles.loc[:, '_label_'] = explainer.label
In [29]:
cp.plot(variables = ["age", "capital-loss"])
In [33]:
cp2 = explainer.predict_profile(new_observation = X_test.iloc[[2]])
cp2.plot(variables = ["age", "capital-loss"])
Calculating ceteris paribus: 100%|██████████| 100/100 [00:03<00:00, 25.04it/s]
/usr/local/lib/python3.10/dist-packages/dalex/predict_explanations/_ceteris_paribus/utils.py:30: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

In [36]:
X_test.iloc[[100, 2]]
Out[36]:
age fnlwgt educational-num capital-gain capital-loss hours-per-week workclass_Federal-gov workclass_Local-gov workclass_Never-worked workclass_Private ... native-country_Portugal native-country_Puerto-Rico native-country_Scotland native-country_South native-country_Taiwan native-country_Thailand native-country_Trinadad&Tobago native-country_United-States native-country_Vietnam native-country_Yugoslavia
784 49 139571 9 4064 0 40 0 0 0 1 ... 0 0 0 0 0 0 0 1 0 0
524 53 139671 10 0 0 50 0 1 0 0 ... 0 0 0 0 0 0 0 1 0 0

2 rows × 100 columns

In [14]:
pdp = explainer.model_profile(grid_points = 50)
Calculating ceteris paribus: 100%|██████████| 100/100 [00:09<00:00, 10.70it/s]
/usr/local/lib/python3.10/dist-packages/dalex/predict_explanations/_ceteris_paribus/utils.py:30: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

In [15]:
pdp.plot(variables=["age", ], geom="profiles", title="Partial Dependence Plot with individual profiles")
In [37]:
explainer2 = dx.Explainer(linear_model, X_test, target_test, predict_func)
pdp2 = explainer2.model_profile(grid_points = 50)
Preparation of a new explainer is initiated

  -> data              : 9769 rows 100 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 9769 values
  -> model_class       : sklearn.linear_model._logistic.LogisticRegression (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function predict_func at 0x7fde2ee6cb80> will be used
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 8.1e-06, mean = 0.238, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.959, mean = 0.00153, max = 1.0
  -> model_info        : package sklearn

A new explainer has been created!
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:439: UserWarning:

X does not have valid feature names, but StandardScaler was fitted with feature names

Calculating ceteris paribus: 100%|██████████| 100/100 [00:07<00:00, 14.11it/s]
/usr/local/lib/python3.10/dist-packages/dalex/predict_explanations/_ceteris_paribus/utils.py:30: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

In [43]:
pdp2.plot(variables=["age"], geom="profiles", title="Partial Dependence Plot with individual profiles")